Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new gene-model module for refFlat, GTF, and GFF3 loading and block classification, wires a new 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
tests/test_rna.rs (1)
247-247: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse the shared float assertion macro in these tests.
These ad-hoc
abs() < epschecks are inconsistent with the test suite’s standard helper and give worse failure output. Replace them withassert_float_eq!(a, b, eps). As per coding guidelines,tests/**/*.rs: "Useassert_float_eq!(a, b, eps)macro for float comparisons in tests".Also applies to: 335-339, 411-424, 438-439, 455-455
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_rna.rs` at line 247, Replace the ad-hoc float comparison in the RNA tests with the shared float assertion macro. In the affected test cases around the frac_ribosomal_bases and other float checks in tests/test_rna.rs, use assert_float_eq!(a, b, eps) instead of manual abs() < eps logic so the tests follow the suite’s standard pattern and produce consistent failure output.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/commands/common.rs`:
- Around line 22-25: Update the directory resolution logic in `common::` so root
prefixes like `/` or Windows drive roots are treated as the filesystem root
instead of falling back to `.`. The current `prefix.parent()` handling in the
`dir` selection should preserve root paths when `Path::parent()` returns `None`
or an empty parent. Add a regression test around the startup check path to cover
a root prefix case and verify it does not incorrectly pass as the current
directory.
In `@src/commands/rna.rs`:
- Around line 671-678: The top-1000 selection in the RNA command can include
more than 1000 transcripts when multiple loci share the cutoff mean, so tighten
the filtering in the selection logic around best_per_locus/means to
deterministically keep exactly 1000 entries. Update the threshold-based check
and the later loop that builds the retained transcript set so ties at the
boundary are broken in a stable way or truncated after sorting, ensuring n_tx
cannot exceed 1000 and the downstream bias/CV medians reflect the intended
top-1000 behavior.
- Around line 1230-1240: Update mate_alignment_end to follow the command’s
fragment-end contract: when mate_cigar/mate_alignment_blocks cannot derive the
mate end from MC, fall back to TLEN before using the read length. Locate the
logic in mate_alignment_end and replace the current sequence-length-based
fallback with TLEN-derived end coordinates, keeping the existing MC path intact
so pair-span bounds used by ribosomal classification and strand-template
inference remain correct.
In `@src/gene_model/gff.rs`:
- Around line 109-114: The tx_order loop in gff.rs is still indexing features
with features[&id], which will panic when an exon/CDS references an orphan
Parent ID that never created a feature record. Update this lookup in the parsing
path to use a safe get-based check in the same block that already handles
exons_by_parent, and either return a parse error or skip the orphan explicitly.
Keep the rest of the transcript/gene resolution logic intact by guarding the
feature access before using feature.parents.first() and the subsequent gene
lookup.
In `@src/gene_model/mod.rs`:
- Around line 447-449: `detect_format` is currently swallowing read/sniffing
failures by using `line.ok()?`, which turns I/O or decompression problems into
an unknown format result. Update `detect_format` in `gene_model::mod` to return
`Result<Option<GeneModelFormat>>` instead of `Option`, and propagate the
`BufRead::lines()` read error upward rather than converting it to `None`. Make
sure callers that rely on `detect_format` handle the error case explicitly,
while preserving the existing format-detection logic for successfully read
lines.
- Around line 514-518: The transcript clustering logic in the
`clusters.iter_mut().find(...)` block is comparing `other.contig` and
`tx.contig` as raw strings, which can split the same locus when contig aliases
differ. Update the comparison to use the resolved BAM contig identity instead of
the original names, reusing the existing contig-resolution logic used elsewhere
in `src/gene_model/mod.rs` before calling `spans_overlap`. Make sure transcripts
like `1`/`chr1` or `MT`/`chrM` are grouped into the same cluster when they map
to the same underlying contig.
In `@src/plotting.rs`:
- Around line 129-130: The early return in plotting logic leaves an old PDF
behind when no series qualify, so update the empty-case handling in the plotting
routine in `src/plotting.rs` to clear any previously generated output at `path`
before returning. Make the fix in the same branch that checks `plots.is_empty()`
so stale histogram files are removed whenever the current run produces no plot.
---
Nitpick comments:
In `@tests/test_rna.rs`:
- Line 247: Replace the ad-hoc float comparison in the RNA tests with the shared
float assertion macro. In the affected test cases around the
frac_ribosomal_bases and other float checks in tests/test_rna.rs, use
assert_float_eq!(a, b, eps) instead of manual abs() < eps logic so the tests
follow the suite’s standard pattern and produce consistent failure output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 185780c6-cb77-421d-986b-79d5f06a8684
📒 Files selected for processing (25)
CHANGELOG.mdREADME.mdsrc/commands/alignment.rssrc/commands/basic.rssrc/commands/common.rssrc/commands/docs.rssrc/commands/error.rssrc/commands/gcbias.rssrc/commands/hybcap.rssrc/commands/isize.rssrc/commands/mod.rssrc/commands/multi.rssrc/commands/rna.rssrc/commands/wgs.rssrc/gene_model/gff.rssrc/gene_model/gtf.rssrc/gene_model/mod.rssrc/gene_model/refflat.rssrc/intervals.rssrc/lib.rssrc/main.rssrc/plotting.rstests/helpers/mod.rstests/test_multi.rstests/test_rna.rs
…t prefix) Three fixes from the PR #40 review: - mate_alignment_end: fall back to the TLEN-derived fragment end (min(read_start, mate_start) + |TLEN| - 1, exactly Picard's CoordMath.getEnd(fragmentStart, abs(insertSize))) when the MC tag is absent, instead of the read's own length. This matches the documented contract (the module doc and README both say 'falling back to TLEN, never read length') and the values fed to the ribosomal and strand-template fragment-span checks. Only affects pairs lacking MC; output on MC-tagged data is byte-identical. Extracted tlen_fragment_end as a pure, unit-tested helper. - detect_format: return Result<Option<GeneModelFormat>> and propagate line read / decompression errors instead of collapsing them to 'could not detect format' via line.ok()?. The gzip-without-extension hint now runs up front so that case still gets a targeted message. - validate_output_prefix: treat a root prefix ('/', 'C:\') as the directory to check rather than falling back to '.'. Declined (with rationale): tie-handling >1000 transcripts (f64 means don't tie in practice), gff features[&id] 'panic' (false positive — every tx_order id is inserted into features), cluster-on-resolved-contigs (real models use one contig convention per file), and remove-stale-PDF (deleting a user's file to avoid a rare stale-output case is the wrong trade-off). 619 tests pass; output byte-identical on the MC dataset.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
ERRATA.md (2)
228-230: 📐 Maintainability & Code Quality | 🔵 TrivialAdd language specifier to fenced code block.
- ``` + ```text TIN = 100 · e^H / n, where H = −Σ pᵢ·ln pᵢ, pᵢ = cᵢ / Σⱼ cⱼ<details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@ERRATA.mdaround lines 228 - 230, Add a language specifier to the fenced
code block in the TIN formula snippet so the markdown renders as a text block;
update the existing fence around the equation to use a text code fence while
keeping the formula content unchanged.</details> <!-- cr-comment:v1:4267d3d4f914df7a3731cdf0 --> _Source: Linters/SAST tools_ --- `237-239`: _📐 Maintainability & Code Quality_ | _🔵 Trivial_ **Add language specifier to fenced code block.** ```diff - ``` + ```text TIN = 100 · exp( −D_KL(normalized coverage ‖ uniform) )<details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@ERRATA.mdaround lines 237 - 239, Add a language specifier to the fenced
code block in the ERRATA documentation snippet so it is marked as text (for
example, the block containing TIN = 100 · exp( −D_KL(normalized coverage ‖
uniform) ) should use a text fence). Update the markdown fence itself and keep
the equation content unchanged.</details> <!-- cr-comment:v1:d74abdec73475219503a0387 --> _Source: Linters/SAST tools_ </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Inline comments:
In@tests/test_rna.rs:
- Around line 653-654: The strand-count check in the RNA test is too loose and
no longer detects double-counting; update the assertion on m.r1_tx_strand_reads
to require the exact expected value using the existing metric contract in this
test, and keep the companion m.r2_tx_strand_reads assertion unchanged. Use the m
variable in the test case with 10 FR pairs so the assertion verifies the precise
strand count rather than a minimum threshold.
Nitpick comments:
In@ERRATA.md:
- Around line 228-230: Add a language specifier to the fenced code block in the
TIN formula snippet so the markdown renders as a text block; update the existing
fence around the equation to use a text code fence while keeping the formula
content unchanged.- Around line 237-239: Add a language specifier to the fenced code block in the
ERRATA documentation snippet so it is marked as text (for example, the block
containing TIN = 100 · exp( −D_KL(normalized coverage ‖ uniform) ) should use a
text fence). Update the markdown fence itself and keep the equation content
unchanged.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Pro **Run ID**: `7d97d606-38d7-4417-8a19-db82feb1d955` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between d425634d41ea6a16065f1f06723db39786ed6382 and 37e9993e504b51f396d1dc5573c782eb14082a7c. </details> <details> <summary>📒 Files selected for processing (8)</summary> * `CHANGELOG.md` * `ERRATA.md` * `README.md` * `src/commands/docs.rs` * `src/commands/rna.rs` * `src/gene_model/mod.rs` * `src/gene_model/refflat.rs` * `tests/test_rna.rs` </details> <details> <summary>✅ Files skipped from review due to trivial changes (1)</summary> * CHANGELOG.md </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (3)</summary> * src/commands/docs.rs * src/gene_model/mod.rs * src/commands/rna.rs </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
CodeRabbit (PR #40): the auto_detects_forward_strandedness test builds exactly 10 FR pairs, so r1_tx_strand_reads is deterministically 10. Use assert_eq!(.,10) instead of >= 10 so the test catches template double-counting.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/rna-integrity.md (1)
26-29: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd language specifier to fenced code block.
The formula block at Line 26 lacks a language tag, triggering
markdownlint-cli2MD040. Addtextormathto silence the lint and improve renderer consistency.- ``` + ```text TIN = 100 · e^H / n, H = −Σ pᵢ·ln pᵢ, pᵢ = cᵢ / Σⱼ cⱼ = 100 · exp( −D_KL(coverage ‖ uniform) )<details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@docs/rna-integrity.mdaround lines 26 - 29, The fenced formula block in the
RNA integrity doc is missing a language specifier, which triggers markdownlint
MD040. Update the fenced block around the TIN formula in the markdown so it uses
a language tag such as text or math, keeping the existing content unchanged to
satisfy the linter and improve rendering consistency.</details> <!-- cr-comment:v1:a75e9f760e29d064033f042a --> _Source: Linters/SAST tools_ </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Nitpick comments:
In@docs/rna-integrity.md:
- Around line 26-29: The fenced formula block in the RNA integrity doc is
missing a language specifier, which triggers markdownlint MD040. Update the
fenced block around the TIN formula in the markdown so it uses a language tag
such as text or math, keeping the existing content unchanged to satisfy the
linter and improve rendering consistency.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Pro **Run ID**: `08c3800f-4c63-4fe7-a1b1-ef843981e7dd` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between a22cec3e6a11eb627511a5a63108b693efa84912 and a3ff63523bb514924f398be200c7d27a9f61d8a1. </details> <details> <summary>⛔ Files ignored due to path filters (2)</summary> * `docs/images/rna-integrity-rin-vs-tin.png` is excluded by `!**/*.png` * `docs/images/rna-integrity-tin-vs-depth.png` is excluded by `!**/*.png` </details> <details> <summary>📒 Files selected for processing (7)</summary> * `ERRATA.md` * `README.md` * `docs/rna-integrity.md` * `src/commands/docs.rs` * `src/commands/rna.rs` * `src/plotting.rs` * `tests/test_rna.rs` </details> <details> <summary>✅ Files skipped from review due to trivial changes (2)</summary> * README.md * ERRATA.md </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (3)</summary> * src/commands/docs.rs * src/plotting.rs * tests/test_rna.rs </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Ports CollectRnaSeqMetrics + transcript insert size from Picard/fgbio into a new `riker rna` subcommand, with supporting gene-model parsing and plotting. - Gene models: GTF, GFF (RefSeq NC_* accession mapping), and refFlat parsers under src/gene_model/, streamed via fgoxide rather than slurped to a String. - Hot path: per-locus exon-union coverage with a coordinate-sweep locus lookup and enclosure-based insert size; linear-time per-base classification. - Metrics: read origin, gene detection, splice junctions, TIN, biotype output, expanded QC plots, and a TLEN insert-size fallback. TIN validated on GENCODE v50. - Integrates with the multi command (CollectorKind::Rna) and the docs registry. - Startup hardening: validate output writability at startup (fail-fast). - Docs: rna tool docs plus intentional differences vs Picard/fgbio split into ERRATA.md and docs/rna-integrity.md. Squashed from 20 commits on tf_rna (preserved in tf_rna_pre_rebase); the underlying feature is a single new command.
Correctness / robustness: - GTF/GFF3: fold the `stop_codon` feature into the coding extent so cds_end includes the stop codon, matching Picard and the refFlat path. GENCODE emits stop_codon as a separate feature that CDS excludes; RefSeq/refFlat already include it. - refFlat parser: enforce txStart<txEnd and exon/CDS within the transcript span, so the unchecked u32 subtractions in count_blocks/build_loci can't silently wrap on a malformed file — reject it instead. - rna: saturating_mul on the `2 * end_bias_bases` short-transcript gate so a pathological `--end-bias-bases` can't wrap it and panic the coverage slices. - gene-model format auto-detect: choose GTF vs GFF3 by which key/value separator appears first, so a GTF value containing '=' isn't misread as GFF3. - gene-model parse errors now carry the file path and detected format. Cleanup / placement: - Move overlap_inclusive/encloses_inclusive out of gene_model (rna is the only caller) into rna.rs and drop them from the public API. - Move the MateSpan helper struct out of the module-function section. - intervals::read_file_contents back to private, and fix its stale doc: it is not shared with the gene-model reader (which streams via fgoxide). - Wrap the rna `--help` long_about to house style; mark the unused FG_PINE brand colour `#[allow(dead_code)]`; fix the module doc that overstated what `--exclude-duplicates` drops. - CHANGELOG: correct the rna output-file list (drop the non-existent .rna-coverage.txt; add the biotype/gene-expression charts). Tests: rna-through-multi integration test; stop_codon extension (GTF + GFF3); refFlat span validation; contig reconciliation; format auto-detect '='. Deferred to follow-ups (#20, #21): reduce rna per-read CIGAR allocation; evaluate the noodles gtf/gff parsers. Skipped as theoretical: the split-MC insert-size dedup edge case.
GFF3 percent-encodes reserved characters (; = & , % tab newline) inside attribute values; our hand-rolled parser left them encoded, so a gene name or Name containing an encoded comma/semicolon came through mangled. Decode values at extraction (gene_name / Name / gene_id / transcript_id / ID, biotype, Parent, and the RefSeq region accession->name mapping). Parent is split on the raw ',' separator first, then decoded, so an encoded comma (%2C) inside one id is not treated as a separator. percent-encoding was already in the tree transitively; promoted to a direct dep. This closes out the noodles-gtf/gff parser evaluation. A full switch to noodles was ~3.5x slower on the gene-model parse (8.4s vs 2.4s on GENCODE v29 GTF, since noodles builds a full percent-decoded attribute map per line) and left all the source-specific grouping / biotype-key / RefSeq-accession logic ours -- so percent-decoding, the one real gap noodles would have closed, is added directly to the hand-rolled parser instead.
bases_overlapping_exons was the single largest self-time item in the --threads 3 hot-path profile (~6% self). It summed the overlap of every alignment block against every exon of a transcript — an O(blocks×exons) product — even though a read's blocks and a transcript's exons are both sorted by start and internally disjoint, so each block can only overlap a short contiguous run of exons. Replace the nested product with the same partition_point + break merge-walk already used by overlap_with_intervals: for each block, binary-search the first exon that can overlap and scan only until an exon starts past the block. O(blocks + exons) with early termination. Output is byte-identical on the 80M-read ENCODE BAM (all four rna TSVs). Instructions retired drop ~1.5% (~600.4B -> ~591.2B, consistent across runs); wall time on this laptop is thermal-noise-dominated and can't resolve a change this size.
classify_bases and tally_insert_size each called rec_alignment_blocks on the same record independently, so every paired read walked and allocated its CIGAR alignment blocks twice. Parse them once at the top of accept() into a local Vec and pass &[(u32,u32)] to both; classify_bases no longer needs the record at all. The block Vec is deliberately a local passed by reference, not a field on the collector: a self-owned scratch buffer read in a loop that also mutates other self fields forces aliasing-conservative reloads (measured ~3.5% regression previously), whereas a provably non-aliasing local does not. Output byte-identical on the 80M-read ENCODE BAM. Instructions retired drop ~2.4% (~590.2B -> ~575.8B, consistent across runs).
The MC (mate CIGAR) tag was parsed up to three times per read: once in
classify_ribosomal and once in tally_strand (both via mate_alignment_end,
each of which copied the tag bytes and re-walked the CIGAR only to take
its reference end) and a third time in resolve_mate_span for insert size.
For a paired RNA read that path meant ~3 tag copies + 3 CIGAR parses +
their allocations, every read.
Parse it once at the top of accept() into a MateAlign { Absent, Malformed,
Parsed { blocks, end } }, computed only for same-contig mapped pairs (the
only reads any consumer uses it for), and thread &MateAlign through the
three callers:
- mate_alignment_end -> mate_end_or_tlen(record, mate_start, &MateAlign):
reads the precomputed end, same TLEN fallback for Absent/Malformed.
- resolve_mate_span borrows the parsed blocks via Cow (owned only for the
rare MC-absent TLEN estimate), so MateSpan no longer allocates on the
exact path.
Absent conflates 'not a same-contig pair' with 'no MC tag', which is
sound: every consumer that could see the former already returns before
using the value (different-contig strand reads have proper=false, so the
mate end is unused). Malformed maps to the same skip the old produced.
Output byte-identical on the 80M-read ENCODE BAM. Instructions retired
drop ~5.3% (~573.9B -> ~543.5B) and wall time ~29.2s -> ~27.3s this run;
cumulative with the two prior commits ~9.5% fewer instructions vs the
start of this pass.
rec_alignment_blocks, mate_alignment_blocks, and rec_introns each returned a freshly heap-allocated Vec per read; on the 80M-read BAM that is on the order of 150M+ small allocations, and rpmalloc alloc/dealloc was ~8% of the compute thread's self time in the profile. Return a SmallVec<[(u32,u32); 4]> (aliased CigarBlocks) instead: a read's M/=/X blocks (unspliced 1, one junction 2, ...) and its introns fit inline in the common case, so the parse never touches the heap; only reads spanning more than four blocks spill. Every consumer already took a slice (&[(u32,u32)]) or borrows via MateSpan's Cow, so nothing downstream changed. The MateAlign::Parsed blocks field carries the same SmallVec. Output byte-identical on the 80M-read ENCODE BAM. Instructions retired drop ~3.75% (~544.0B -> ~523.6B) and wall time ~27.6s -> ~26.6s; cumulative over this pass ~12.8% fewer instructions vs the start.
accept() allocated a fresh Vec<u32> for the gene loci overlapping each read. Most reads fall in zero or one locus, so this was ~one small heap allocation per read for a 1-2 element vector. Return the sweep results into a SmallVec<[u32; 4]> (aliased LociVec) instead: zero/one/few-locus reads stay inline and never touch the heap; only reads over more than four overlapping genes spill. overlapping_loci still clear()s and extend()s it, and every consumer already took &[u32], so nothing else changed. Output byte-identical on the 80M-read ENCODE BAM. Instructions retired drop ~1.4% (~523.6B -> ~516.4B); cumulative over this pass ~14% fewer instructions vs the start.
parse_mate_align went through mate_cigar, which did record.aux_tag(MC)? .as_str().map(|s| s.to_vec()): aux_tag already hands back an owned AuxValue (the aux store is cloned out), and .to_vec() then allocated and memcpy'd a *second* copy of the same bytes for every paired read carrying an MC tag. Inline mate_cigar into parse_mate_align and move the Vec out of the owned AuxValue::String via a let-else, so there's exactly one allocation (the one aux_tag already made) and no extra memcpy. A missing or non-string MC tag maps to Absent exactly as the old Option path did. Output byte-identical on the 80M-read ENCODE BAM. Instructions retired drop ~1.7% (~516.6B -> ~507.9B); cumulative over this pass ~15.4% fewer instructions vs the start.
On the multi-locus classification path, classify_blocks unioned the overlapping genes' exon and coding intervals by concatenating them and calling merge_intervals, which sort_unstable()s the whole set. But each locus's exon_union / coding_union is already sorted and disjoint by construction, so the sort is redundant work — and quicksort was ~3% of the compute thread's self time in the profile, almost all of it from this per-read path. Add union_of_sorted_disjoint (a pairwise merge of pre-sorted sets) and use it for the coding and exon unions; the per-locus spans stay on merge_intervals since there are only a few and they aren't pre-sorted as a set. Factor the coalescing step both share into push_coalesced. The union is the same set regardless of merge order, so output is unchanged; added unit tests pin the k-way result to merge_intervals' and cover overlap, adjacency, dedup, and empty/single-set inputs. Output byte-identical on the 80M-read ENCODE BAM. Instructions retired drop ~3.5% (~507.8B -> ~490.1B) and wall time ~26.2s -> ~24.8s; cumulative over this pass ~18% fewer instructions vs the start.
Code review found that bases_overlapping_exons's merge-walk (binary-search + early-exit) silently assumed a transcript's exons are pairwise disjoint — a precondition the parsers never enforce (they only sort). A malformed or duplicate exon record for one transcript (from hand-edited or merged annotation files) breaks the partition_point's monotonic-end precondition, which could undercount to zero or, in debug builds, panic on u32 underflow. Make the invariant real: normalize_exons() sorts each transcript's exons and coalesces any that overlap, so the exon list handed to every downstream interval walk is sorted and pairwise disjoint. On real annotations (whose per-transcript exons never overlap) this only sorts — a no-op that keeps output byte-identical on the 80M-read ENCODE BAM. It does not change the locus exon_union at all (merge_intervals already coalesced the same intervals), so gene-level coverage is untouched; the only effect is that a malformed overlapping-exon transcript now counts each exonic base once (sane deduplicated per-transcript coverage) instead of double-counting. Keeps the merge-walk's full speed (the ~2.3% it's worth); normalize is a one-time load cost (~0.3%).
…d-MC coverage Follow-ups from the perf-pass review, no behavior change: - Delete a stale doc line left over from the removed mate_cigar fn, which had glued itself onto parse_mate_align's doc block. - Move the smallvec import into the third-party group (was appended after the crate:: group). - Rename mate_alignment_end_includes_trailing_skip -> _blocks_ (it exercises mate_alignment_blocks; the old function name no longer exists). - Point the bases_overlapping_exons doc at the newly-enforced disjoint-exon invariant. - Add insert_size_skips_pair_with_malformed_mc: pins that a present-but- unparseable MC tag (MateAlign::Malformed) skips the pair rather than falling through to the TLEN estimate.
Summary
Adds the
rnacommand: a single-pass, single-gene-model-load RNA-seq QC tool. It started as a port of PicardCollectRnaSeqMetrics(base composition, strand specificity, transcript 5'/3' coverage bias and CV) + fgbioEstimateRnaSeqInsertSize(transcript-space insert size, introns collapsed), and has grown into a superset that also covers the RSeQC / RNA-SeQC / Qualimap families: read accounting and genomic origin, gene detection, splice-junction annotation, transcript integrity (TIN), and per-biotype read counts. The collector is wired intomulti.Outputs (house conventions — lowercase headers,
frac_, no metadata lines):out.rna-metrics.txt(one wide row),out.rna-biotype.txt(+.pdf),out.rna-gene-expression.pdf,out.rna-coverage.pdf,out.rna-insert-size.txt/-histogram.txt(+.pdf).Metrics (one
rna-metrics.txtrow)assigned_readsand fractions.genes_detected(genes with ≥--genes-detected-min-reads).frac_known_juncs.frac_mrna_bases,frac_usable_bases.Highlights
chradd/strip,MT↔chrM, RefSeqNC_*→ common name). Gzip handled transparently.--strand auto, default) from observed R1/R2 transcription-strand counts;unstranded/forward/reverseforce a model.--ribosomal-intervalsfile (BED or IntervalList).MC(mate CIGAR) tag when present (exact, any orientation); whenMCis absent it estimates FR pairs from a spec-compliantTLENat the forward read rather than dropping them — see below.isizehelper so the two match).TIN scale & validation
riker's TIN is the largest deliberate numeric difference from the reference tools: it scores one well-covered transcript per gene, so its median runs ~14–16 points higher than RSeQC/RustQC, which score every isoform. The two report the same shape on a different scale. This is validated in
docs/rna-integrity.mdon a controlled RNA-degradation ladder (Sigurgeirsson et al. 2014, RIN 10→2) plus a depth-downsampling sweep: both implementations track degradation in lockstep, and riker is markedly more depth-robust (≈0.5% vs ≈4% TIN change across a 5× depth reduction). Indexing and alignment were both done with rustar-aligner against GENCODE v50, for a single-tool story.Intentional differences vs. Picard/fgbio (documented in ERRATA.md)
Deliberate corrections/choices, not regressions; on identical inputs read/base accounting matches Picard to the base and composition agrees within ~0.5%:
2×end_bias_bases) excluded from the CV/bias set; every aligned base counted (Picard drops the last base of each block); non-overlapping normalized-position bins.Math.max); fragment end from theMCtag, falling back toTLEN, never read length.MCthe mate's exact blocks are used (any orientation). Without it, FR pairs are kept by deriving the mate's right end fromstart + |TLEN| - 1at the forward read: the 5'-to-5' transcript-space size is then exact, and the unknown mate splicing is gated by requiring both mate ends in an exon plus a ≥95% collapsed-exonic footprint. Validated against the MC path on a real BAM (9.7M FR pairs; mean/median agree within ~0.5%). A BAM with neitherMCnor a validTLENyields no/incorrect insert sizes —samtools fixmate -mremains the way to guarantee exactness.@HD SO:coordinate); validated at startup, enabling the allocation-free locus sweep.Dropped Picard's per-position
*.rna-coverage.txttable — Picard emits it only because its plotting engine is external; riker plots internally, so only the.pdfis kept.Performance & memory
Profiled and tuned after the correctness-complete implementation (composition + coverage verified byte-identical at every step on a 75.2M-read PE BAM, GENCODE v29):
The metrics-expansion (read origin, junctions, biotype, TIN) added ~1.5% runtime. riker stays ~10× faster single-threaded than RustQC and far lighter. BGZF decompression (~11%) is the remaining single-threaded floor.
Testing & review
fmt/clippy --all-targets(pedantic) /nextestall clean.CollectRnaSeqMetrics(read accounting exact; composition ~0.5%; coverage/bias ~1.3%, in the corrected direction), cross-checked vs. RustQC (junctions, TIN, biotype), and the TIN dose-response validated end-to-end on real degradation-ladder data.Out of scope (tracked separately)
A curated GRCh38 rDNA BED / rRNA interval list with methodology, plus per-biotype/expression refinements — to be their own changes.